Steve Elston
10/27/2022
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a
posterior distribution
Contrasts with frequentist statistics with objective to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information as a prior distribution
The posterior distribution is said to quantify our current belief
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Predictions are made by simulating from the posterior distribution
Two functions must be defined to compute the posterior distribution
How can we extend Bayes models to more complex problems?
For simple problems we can use a conjugate prior and posterior
What about problems with complex likelihood?
Need to perform sampling to compute approximation of posterior
Highly efficient sampling methods required for complex problems
Real-world Bayes models have large numbers of parameters, into the millions
Naive approach is simple grid sampling
Consider this thought experiment, sampling dimension 100 times:
Computational complexity of grid sampling has exponential scaling with dimensionality
Need a better approach!
How can we scale Bayesian models to 1000s of parameters?
Large-scale Bayesian models need highly efficient sampling methods
Markov chain Monte Carlo (MCMC) sampling is efficient and scalable
Rather than sampling on a grid, MCMC methods sample distributions efficiently
Requires effort to understand how the algorithm works
MCMC sampling uses a Markov processes sampling chain
\(X_t\) is the state vector of state probabilities for \(N\) possible states
\[X_t = [x_1, x_2, x_3, \ldots, x_N]\]
A Markov process is a stochastic process that transitions from a current state probabilities, \(X_t\), to some next state probabilities, \(X_{t+1}\), with probability \(\Pi\)
Key properties of a Markov process:
Since Markov transition process depends only on the current state a Markov process is memoryless
\[p(X_{t + 1}| X_t, X_{t-1}, X_{t-2}, \ldots, X_0) = p(X_{t + 1}| X_t)\]
For system with \(N\) possible states we can write the transition probability matrix, \(\Pi\):
\[\begin{align} \Pi &= \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\ \pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\ \cdots & \cdots & \cdots & \cdots \\ \pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N} \end{bmatrix}\\ &where\\ \pi_{i,j} &= probability\ of\ transition\ state\ x_j\ to\ state\ x_i\\ &and\\ \pi_{i,i} &= probability\ of\ staying\ in\ state\ x_i\\ &further\\ \pi_{i,j} &\ne \pi_{j,i}\ in\ general \end{align}\]
To make the foregoing more concrete let’s construct a simple example. We will start with a system of 3 states, \(\{ x_1, x_2, x_3 \}\). The transition matrix is:
\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \pi_{1,3}\\ \pi_{2,1} & \pi_{2,2} & \pi_{2,3}\\ \pi_{3,1} & \pi_{3,2} & \pi_{3,3} \end{bmatrix} = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \]
Let’s apply a probability matrix to a set of possible states
\[\vec{X}_{t+1} = \Pi\ \vec{X}_t = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix} \]
The foregoing is a single step of a Markov process
What happens when there is a series of transitions?
A sequence of such transitions is known as a Markov chain
There are two classes of Markov Chains
Episodic Markov chains have a terminal state
Continuous Markov chains have no terminal state
Markov chain comprises a number of state transitions
Chain of \(n\) state transitions, \(\{t_1, t_2, t_3, \ldots, t_n \}\)
Each transition has the probabilities given by the state transition matrix, \(\Pi\)
To estimate the probabilities of being in the states we use a special case known as a stationary Markov chain
Start with initial state, \(\vec{x}_0\) for a continuous Markov chain:
\[\Pi\ \Pi\ \Pi\ \ldots \Pi\ \vec{X}_t = \Pi^n\ \vec{X}_t \xrightarrow[\text{$n \rightarrow \infty$}]{} p(\vec{X})\]
We can find the probabilities of the states without knowing the values of the transition matrix, \(\Pi\)!
As long as we can repeatedly sample the stochastic Markov process, we can estimate the state probabilities
This is the key to Markov Chain Monte Carlo sampling!
The first MCMC sampling algorithm developed is the Metropolis-Hastings (M-H) algorithm; often referred to as Metropolis algorithm or the M-H algorithm.
Pick a starting point in the parameter space
Choose a nearby point in parameter space randomly from a
sampling distribution
Evaluate the posterior at this point
Use the following decision rule to accept or reject the new sample:
If the sample is accepted, compute the posterior density at the new sample point
Repeat steps, 2, 3, 4 and 5, many times, until convergence
M-H algorithm uses the following decision rule to accept or reject the new sample:
If the likelihood, \(p(data | parameters)\), of the new point is greater than the current point, accept new point
If the likelihood of the new point is less than the current point, accept with probability according to the ratio:
\[Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}\]
Eventually, the M-H algorithms converges to the posterior distribution
M-H random sampling algorithm is far more sample efficient than naive grid sampling
Consider that the M-H algorithm probabilistically samples the parameter space
Important properties of the Metropolis-Hastings MCMC algorithm include:
Poor convergence arises from low sample efficiency
Algorithm must be ‘tuned’ to ensure sample efficiency
Tuning finds a good dispersion parameter value for the state sampling distribution
Dispersion parameter determines the size of the jumps the algorithm makes
Example, for Normal sampling distribution pick the variance \(\sigma^2\)
Let’s try a simple example, find an estimate of the posterior density of a bivariate Normal distribution
\[\mu = \begin{bmatrix} .5\\ .5 \end{bmatrix}\]
\[Covariance = \begin{bmatrix} 1, .6\\ .6, 1 \end{bmatrix}\]
Random samples from a bivariate Normal distribution
And, the marginal distributions of the variables
Marginal distributions of bivariate Normal samples
Now, we are ready to sample these data using the M-H MCMC algorithm
The algorithm is
MCMC samples from bivariate Normal distribution
Plot the first 500 samples
MCMC samples from bivariate Normal distribution
Notice the long ‘tail’ on the sampled distribution from the burn-in period.
Marginal distributions of the MCMC samples, less first 500
First 500 MCMC samples from bivariate Normal distribution
These marginals are similar to the original distribution
How can we understand the onvergence properties of the M-H MCMC sampler
MCMC sampling generally convergences to the underlying distribution, but can be slow
In some pathological cases, convergence may not occur at all
The acceptance rate and rejection rate are key convergence statistics for the M-H algorithm
For our example these statistics are fairly good:
\[Acceptance\ rate = 0.81\] \[Rejection\ rate = 0.19\]
Trace plot of the samples displays the sample value of the parameter with sample number
Trace plots for two variables from the M-H algorithm
The traces show sampling around the highest density values of the parameters, indicating good convergence
An autocorrelation plot shows how a sample value is related to the previous samples
Autocorrelation of Markov chain for the two parameters
Notice that the autocorrelation dies off fairly quickly with lag
Low auto correlation indicates good sampling efficiency
We can relate sampling efficiency to the autocorrelation of the samples
\[ESS = \frac{N}{1 + 2 \sum_k ACF(k)}\]
A number of other powerful MCMC sampling algorithms have been developed
Gibbs sampler is an improved MCMC sampler which speeds convergence
Named for the 19th Century physicist Josiah Willard Gibbs;
inspired by statistical mechanics
Gibbs sampler samples each dimension of the parameter space
sequentially in a round-robin manner
M-H algorithm attempts jumps across all dimensions of the parameter space.
Compared to the M-H, Gibbs sampling reduces serial correlation through round-robin sampling
Update along each dimension approximately orthogonal to the preceding sampled dimension
There are no tuning parameters since sampling is based on the marginals of the likelihood.
The basic Gibbs sampler algorithm has the following steps:
For an N dimensional parameter space, \(\{ \theta_1, \theta_2, \ldots, \theta_N \}\), find a random starting point
In order, \(\{1, 2, 3, \ldots, N\}\), assign the next dimension to sample, starting with dimension \(1\); actual order not important
Sample the marginal distribution of the parameter given the observations, \(D\), and other parameter values: \(p(\theta_1|D, \theta_2, \theta_3, \ldots, \theta_N)\)
Repeat steps 2 and 3 until convergence
The Hamiltonian sampler was proposed in 1987 (Duane, et.al.) and uses a simple idea from classical mechanics
The Hamiltonian sampler uses a simple idea from classical mechanics
\[\mathtt{H}(q,p) = \mathtt{K}(p,q) + \mathtt{V}(q)\]
The Hamiltonian sampler uses a simple idea from classical mechanics
We can relate position and momentum to the density we want to sample
The Boltzmann distribution of the Hamiltonian
\[p(q,p) = e^{-\frac{\mathtt{H}(q,p)}{kT}}\]
The Hamiltonian sampler uses a simple idea from classical mechanics
\[\mathtt{H}(q,p) = \mathtt{K}(p,q) + \mathtt{V}(q)\]
\[\begin{align} \frac{d\ q}{d\ t} &= \frac{\partial \mathtt{H}}{\partial p} = \frac{\partial \mathtt{K}}{\partial p} + \frac{\partial \mathtt{V}}{\partial p}\\ \frac{d\ p}{d\ t} &= \frac{\partial \mathtt{H}}{\partial q} = \frac{\partial \mathtt{K}}{\partial q} + \frac{\partial \mathtt{V}}{\partial q} \end{align}\]
Hamiltonian MCMC is an extension of the M-H algorithm with steps:
NUTS represents the state of the art in MCMC samplers: Hoffman and Gelman 2014
PyMC3 package uses the NUTS MCMC algorithm.
NUTS improves on HMC
Why even discuss other samplers when we have NUTS?
Hamiltonian MCMC is a complex algorithm - some resources for additional details
Steps of modern Bayes modeling and evaluation
Packages like ArviZ are dedicated to evaluation of Bayes MCMC models.
Multiple examples diagnostics for the breakdown of MCMC sampling can be found in an online Appendix to a seminal paper Vehtari et.al., 2019
Define and construct a simple linear regression model
Data for regression example
Define and construct a simple linear regression model
Model has 4 parameters
Define prior distributions for these parameters:
\[\begin{align} \beta_0 &\sim \mathtt{N}(0, 2)\\ \beta_1 &\sim \mathtt{N}(0, 2)\\ \beta_2 &\sim \mathtt{N}(0, 2)\\ \sigma &\sim |\mathtt{N}(0, 1) \end{align}\]
Define and construct a simple linear regression model
Model has 4 parameters, \([\beta_0, \beta_1, \beta_2, \sigma]\)
Likelihood model:
\[Y \sim \mathtt{N}(\mu, \sigma)\]
Define and construct a simple linear regression model
with pymc3.Model() as regression_model:
# Priors for unknown model parameters
betas = pymc3.Normal("betas", mu=0, sigma=2, shape=3)
sigma = pymc3.HalfNormal("sigma", sigma=1)
# Deterministic expected value of outcome
mu = betas[0] + betas[1] * x1 + betas[2] * x2
# Likelihood (sampling distribution) of observations
Y_obs = pymc3.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)Prior predictive checks to verify prior distribution choice
Prior predictive response vs response variable distribution
Prior predictive checks to verify prior distribution choice
Prior density of Betas and sigma
Evaluation the sampled posterior and perform inference
We sample the posterior using one or more Markov chains
Examine the traces (path of chains)
Traces and posterior density of MCMC traces
Evaluation the sampled posterior and perform inference
Example from Vehtari et.al., 2021, a seminal paper on evaluation of MCMC
We must verify that the samples are representative
Table of MCMC convergence statistics
A lot of information in this summary table
sd) of the mean error of the
posterior distributionWe must verify that the samples are representative
MCSE by quantile for each model parameter
Ideally want MCSE to be small and uniform with quantile
Notice the MCSE is higher for outer quantiles
We must verify that the samples are representative
ESS locally and by quantile for each model parameter
Ideally want ESS to be large and uniform with quantile
Local ESS is fairly uniform with quantile
Notice the quantile ESS is lower for outer quantiles
Evaluate the sampled posterior and perform inference
HDIs of model parameters by trace
HDIs are similar for each parameter by trace
HDIs confirm inferences on uncertainty of the model parameters
Evaluate the sampled posterior and perform inference
Posterior distribution of parameters with HDIs
We must check that predictions from the model are reasonable
Does the posterior distribution look like the distribution of the observed response?
Can simply compare density plots
Apply a test statistic \(T\)
We must check that predictions from the model are reasonable
Base statistic on posterior predictive distribution
Approximate PPD by resampling from the posterior distribution:
\[p(y^{rep}|y)= \sum_i p(y^{rep}|\theta) p(\theta|y)\]
Where:
\(y^{rep} =\) realization drawn from
the posterior distribution
\(y =\) observation
\(p(y^{rep}|\theta) =\) draw from
posterior distribution with parameters \(\theta\)
\(p(\theta|y) =\) posterior
distribution of model parameters, \(\theta\)
We must check that predictions from the model are reasonable
Posterior predictive distribution vs. observed responses
Agreement is generally good between posterior predictive and actual observations
Average predictive is heavier in the tails
We must check that predictions from the model are reasonable
\[p = p(y^{rep} < y | y )\]
We must check that predictions from the model are reasonable
Distribution of Bayesian p-values
Plot is centered on 0.5
Distribution is symmetric
We must check that predictions from the model are reasonable
\[p_i = p(y_i^{rep} < y_i | y )\]
Distribution of Bayesian u-values
Analysis of change point in British coal mine disasters
Toward the end of 19th Century was growing public awareness of
the rising coal mine deaths
Mine disaster defined as incident leading to 6 or more deaths
As early as 1886, Royal Commission published recommendations for safety practices
Analysis of change point in British coal mine disasters
Famous dataset from Kaggle, and other sources
Goal to create a model to quantify the differences in the rate of mine disasters before and after the introduction of safety practices
Example of a point process model
Point process model has an intensity or rate of occurrence of events, e.g. mine disasters
Number events over a time period for this type of point process is iid
Analysis of change point in British coal mine disasters
iid nature of the point process does not mean the distribution of the process cannot change in time
Change points or switch points in time are quite common in the real-world
Analysis of change point in British coal mine disasters
Number of disasters per year 1851-1962
Analysis of change point in British coal mine disasters
\[ D_t \sim Pois(r_t),\ r_t \begin{cases} e,\ t < s\\ l,\ t \ge s \end{cases} \]
\[\begin{align} s &\sim unif(t_{min}, t_{max})\\ e &\sim exp(2) \\ l &\sim exp(2) \end{align}\]
Analysis of change point in British coal mine disasters
with pymc3.Model() as disaster_model:
# Uniform prior on the switch point
switchpoint = pymc3.DiscreteUniform("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max())
# More informative prior based on domain knowledge
# switchpoint = pymc3.Triangular("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max(), c=1900)
switchpoint = pymc3.DiscreteUniform("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max())
# Priors for pre- and post-switch rates
early_rate = pymc3.Exponential("early_rate", 2.0)
late_rate = pymc3.Exponential("late_rate", 2.0)
# Poisson rate switch for years before and after current
rate = pymc3.math.switch(switchpoint >= disaster_data.Year, early_rate, late_rate)
disasters = pymc3.Poisson("disasters", rate, observed=disaster_data.Count)Analysis of change point in British coal mine disasters
HDI of parameters by trace
Analysis of change point in British coal mine disasters
HDI of parameters by trace
The HDIs are consistent across traces
See notebook for MCMC, Prior and Posterior checks
Analysis of change point in British coal mine disasters
Posterior distribution of parameters with HDI
HDI of switch point is in reasonable range
Significant difference in HDI of rates
Analysis of change point in British coal mine disasters
Posterior predictive checks
For complex Bayesian models we need a computationally efficient aproximation
Grid sampling is inefficient
MCMC sampling is based on Markov chains
Several MCMC sampling methods have been developed
NUTS is the state of the art MCMC algorithm
Performance metrics of MCMC sampling